[SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 #34864

hiboyang · 2021-12-10T21:52:31Z

What changes were proposed in this pull request?

This PR (design doc) provides support to store shuffle files on external shuffle storage like S3. It helps Dynamic
Allocation on Kubernetes. Spark driver could release idle executors without worrying about losing
shuffle data because the shuffle data is store on external shuffle storage which are different
from executors.

This could be viewed as a followup work for https://issues.apache.org/jira/browse/SPARK-25299.

There is previously Worker Decommission feature (SPARK-33545), which is a great feature to copy shuffle data to fallback storage like S3. People appreciate that work to address the critical issue to handle shuffle data on Spark executor termination. The work in the PR does not intent to replace that feature. The intent is to get further discussion about how to save shuffle data on S3 during normal execution time.

Why are the changes needed?

To better support Dynamic Allocation on Kubernetes, we need to decouple shuffle data from Spark
executor. This PR implements another Shuffle Manager and support writing shuffle data on S3.

Does this PR introduce any user-facing change?

Yes, this PR adds two Spark config like following to plug in another StarShuffleManager and store
shuffle data on provided S3 location.

spark.shuffle.manager=org.apache.spark.shuffle.StarShuffleManager
spark.shuffle.star.rootDir=s3://my_bucket_name/my_shuffle_folder

How was this patch tested?

Added a unit test for StartShuffleManager. A lot of classes are copied from Spark, thus not add tests
for those classes. We will work with the community to get feedback first, then work on removing code
copy/duplication.

…le on executor lost event

This reverts commit 1418e8f.

This reverts commit de2d2b5.

…uffle file on executor lost event" This reverts commit ddb6d8f.

SparkQA · 2021-12-10T22:45:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50556/

SparkQA · 2021-12-10T23:28:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50556/

c21 · 2021-12-11T00:28:52Z

@hiboyang, thanks for the work here! Could you create a design doc for this? That might help get more people attention and easier for them to understand.

hiboyang · 2021-12-11T00:31:37Z

@hiboyang, thanks for the work here! Could you create a design doc for this? That might help get more people attention and easier for them to understand.

Yes, good suggestion. Will create some design doc.

SparkQA · 2021-12-11T00:43:48Z

Test build #146081 has finished for PR 34864 at commit 4b67c8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class ByteBufUtils
public class StarBlockStoreClient extends BlockStoreClient
public class StarBypassMergeSortShuffleWriter<K, V> extends ShuffleWriter<K, V>
public class StarLocalFileShuffleFileManager implements StarShuffleFileManager
public class StarMapResultFileInfo
public class StarS3ShuffleFileManager implements StarShuffleFileManager
public static class S3BucketAndKey
public class StarUtils
public class StartFileSegmentWriter
class StarShuffleManager(conf: SparkConf) extends ShuffleManager with Logging
final class StarShuffleBlockFetcherIterator(
case class FetchRequest(
case class DeferFetchRequestResult(fetchRequest: FetchRequest) extends FetchResult

linzebing · 2021-12-11T00:57:47Z

Quickly glanced through the code, seems for writing shuffle data we are writing locally first and then upload to S3, similarly for reading shuffle data we are downloading data to a local temp file first and then read.

We should be able to write/read direct to/from S3, right?

dongjoon-hyun

Hi, All. Thank you!

BTW, for the record, Apache Spark 3.1+ already stores its shuffle files into the external storage like S3 and reads back from it.

[SPARK-33545][CORE] Support Fallback Storage during Worker decommission (Apache Spark 3.1.0)
[SPARK-34142][CORE] Support Fallback Storage Cleanup during stopping SparkContext (Apache Spark 3.2.0)
[SPARK-37509][CORE] Improve Fallback Storage upload speed by avoiding S3 rate limiter (Apache Spark 3.3.0)

It would be great not to ignore the existing Spark feature and avoid over-claiming.

Dynamic allocation is the same. Apache Spark has been supporting Dynamic Allocation in K8s too.

hiboyang · 2021-12-13T17:51:06Z

Quickly glanced through the code, seems for writing shuffle data we are writing locally first and then upload to S3, similarly for reading shuffle data we are downloading data to a local temp file first and then read.

We should be able to write/read direct to/from S3, right?

Thanks for looking! Yes, we should be able to write/read direct on S3. This PR is a prototype. Still need to improve the code and performance of writing/reading shuffle data on S3.

hiboyang · 2021-12-13T17:57:00Z

Hi, All. Thank you!

BTW, for the record, Apache Spark 3.1+ already stores its shuffle files into the external storage like S3 and reads back from it.

[SPARK-33545][CORE] Support Fallback Storage during Worker decommission (Apache Spark 3.1.0)

[SPARK-34142][CORE] Support Fallback Storage Cleanup during stopping SparkContext (Apache Spark 3.2.0)

[SPARK-37509][CORE] Improve Fallback Storage upload speed by avoiding S3 rate limiter (Apache Spark 3.3.0)

It would be great not to ignore the existing Spark feature and avoid over-claiming.

Dynamic allocation is the same. Apache Spark has been supporting Dynamic Allocation in K8s too.

Right, Spark has shuffle tracking to support Dynamic Allocation on Kubernetes, but it will not work well when there is shuffle data distributed on many executors (those executors cannot be released).

The work here (storing shuffle data on S3) does not conflict with worker decommission feature. The eventual goal is to store shuffle data on S3 or other external storage directly. Before getting there, people could still use the worker decommission feature.

dongjoon-hyun · 2021-12-13T18:13:39Z

You are completely wrong because you already know the worker decommission feature.

but it will not work well when there is shuffle data distributed on many executors (those executors cannot be released).

You should mention this in the PR description explicitly instead of misleading the users.

The work here (storing shuffle data on S3) does not conflict with worker decommission feature. The eventual goal is to store shuffle data on S3 or other external storage directly.

hiboyang · 2021-12-13T19:57:50Z

Hi Dongjoon, you got some misunderstandings here. I am writing a design doc for this PR. Hope that will help you to understand more and address your questions.

You are completely wrong because you already know the worker decommission feature.

but it will not work well when there is shuffle data distributed on many executors (those executors cannot be released).

You should mention this in the PR description explicitly instead of misleading the users.

The work here (storing shuffle data on S3) does not conflict with worker decommission feature. The eventual goal is to store shuffle data on S3 or other external storage directly.

SparkQA · 2021-12-13T22:58:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50620/

SparkQA · 2021-12-13T23:43:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50620/

SparkQA · 2021-12-14T00:20:09Z

Test build #146147 has finished for PR 34864 at commit 8222f38.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-17T01:25:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50775/

SparkQA · 2021-12-17T02:22:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50775/

SparkQA · 2021-12-17T08:48:19Z

Test build #146303 has finished for PR 34864 at commit 761fe2a.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

hiboyang · 2021-12-20T19:39:03Z

Add a design doc for this prototype.

steveloughran

some quick comments on the pr

steveloughran · 2022-01-05T10:20:56Z

external-shuffle-storage/pom.xml

+            <artifactId>commons-lang3</artifactId>
+        </dependency>
+        <dependency>
+            <groupId>com.amazonaws</groupId>


You should pull in spark-hadoop-cloud and so indirectly get its shaded full aws sdk. yes, it's big, but iat guarantees that it has a consistent set of its own dependencies (http client, jackson etc) and because it includes support for services like STS and s3 events, lets you add new features with guaranteed consistency of aws artifacts.

Thanks for the suggestion! Yes, I was thinking to use that hadoop library as well, then did not do it due to wanting to start small with this prototype. It sounds a good idea to switch to hadoop library.

steveloughran · 2022-01-05T10:22:50Z

external-shuffle-storage/src/main/java/org/apache/spark/starshuffle/StarBlockStoreClient.java

+                    ManagedBuffer managedBuffer = downloadFileWritableChannel.closeAndRead();
+                    listener.onBlockFetchSuccess(blockIdStr, managedBuffer);
+                } catch (IOException e) {
+                    throw new RuntimeException(String.format(


include the inner exception text in the message and supply the exception as the inner exception in the constructor

steveloughran · 2022-01-05T10:23:31Z

external-shuffle-storage/src/main/java/org/apache/spark/starshuffle/StarBlockStoreClient.java

+                    NioManagedBuffer managedBuffer = new NioManagedBuffer(byteBuffer);
+                    listener.onBlockFetchSuccess(blockIdStr, managedBuffer);
+                } catch (IOException e) {
+                    throw new RuntimeException(String.format(


Again, pass on inner exception details.

Yes, good catch! Will add inner exception!

steveloughran · 2022-01-05T10:24:21Z

...fle-storage/src/main/java/org/apache/spark/starshuffle/StarBypassMergeSortShuffleWriter.java

+import scala.collection.Iterator;
+
+import javax.annotation.Nullable;
+import java.io.*;


bit brittle against jvm releases adding new classes here

Good point, let me remove .* here.

steveloughran · 2022-01-05T10:27:09Z

...nal-shuffle-storage/src/main/java/org/apache/spark/starshuffle/StarS3ShuffleFileManager.java

+    public static final String DEFAULT_AWS_REGION = Regions.US_WEST_2.getName();
+
+    private static TransferManager transferManager;
+    private static Object transferManagerLock = new Object();


steveloughran · 2022-01-05T10:35:26Z

...nal-shuffle-storage/src/main/java/org/apache/spark/starshuffle/StarS3ShuffleFileManager.java

+            throw new RuntimeException(String.format(
+                    "Failed to download shuffle file %s", s3Url));
+        } finally {
+            transferManager.shutdownNow();


What if the transfer manager is reused?

This was a code mistake. I should not shutdown shuffle manager here. Will remove this.

steveloughran · 2022-01-05T15:11:18Z

Obviously I am biased, but I believe that rather than trying to use the AWS APIs yourself, you should just use the hadoop file system APIs and interact with S3 through the s3a connector.

For a high-performance upload of a local file, use FileSystem.copyFromLocalFile in s3a on hadoop 3.3.2 this uses the same transfer manager class as this PR does but adds: exception handling/mapping, encryption settings, auditing. And the s3a integration tests verify all this works... By the time you get to use it here you can assume the S3 upload works, and it becomes a matter of linking it up to spark.

As copyFromLocalFile is implemented for all filesystems, it means the component will also work with other stores including google cloud and azure abfs, even if they do not override the base method for a high-performance implementation -yet.

This also means that you could write tests for the feature using file:// as the destination store and include these in the spark module; if you design such tests to be overrideable to work with other file systems, they could be picked up and reused as the actual integration test suites in an external module.

And, because someone else owns the problem of the s3 connector binding, you get to avoid fielding support calls related to configuring of AWS endpoint, region, support for third-party s3 stores, qualifying AWS SDK updates, etc.

Accordingly, I would propose

ripping out StarS3ShuffleFileManager and using FileSystem APIs
write tests which can be pointed at the local FS, but then targeted at real object stores in a downstream test module, as https://github.com/hortonworks-spark/cloud-integration does for committer testing.

Getting integration tests set up is inevitably going to be somewhat complicated. I can provide a bit of consultation there.

hiboyang · 2022-01-05T23:57:57Z

Obviously I am biased, but I believe that rather than trying to use the AWS APIs yourself, you should just use the hadoop file system APIs and interact with S3 through the s3a connector.

For a high-performance upload of a local file, use FileSystem.copyFromLocalFile in s3a on hadoop 3.3.2 this uses the same transfer manager class as this PR does but adds: exception handling/mapping, encryption settings, auditing. And the s3a integration tests verify all this works... By the time you get to use it here you can assume the S3 upload works, and it becomes a matter of linking it up to spark.

As copyFromLocalFile is implemented for all filesystems, it means the component will also work with other stores including google cloud and azure abfs, even if they do not override the base method for a high-performance implementation -yet.

This also means that you could write tests for the feature using file:// as the destination store and include these in the spark module; if you design such tests to be overrideable to work with other file systems, they could be picked up and reused as the actual integration test suites in an external module.

And, because someone else owns the problem of the s3 connector binding, you get to avoid fielding support calls related to configuring of AWS endpoint, region, support for third-party s3 stores, qualifying AWS SDK updates, etc.

Accordingly, I would propose

ripping out StarS3ShuffleFileManager and using FileSystem APIs

write tests which can be pointed at the local FS, but then targeted at real object stores in a downstream test module, as https://github.com/hortonworks-spark/cloud-integration does for committer testing.

Getting integration tests set up is inevitably going to be somewhat complicated. I can provide a bit of consultation there.

Yes, these are great suggestions! Thanks again! I will find time to make change for this, and may also reach out to your for consultation when adding integration test :)

github-actions · 2022-04-16T00:17:14Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

pspoerri · 2022-05-05T14:02:33Z

@hiboyang I looked at your work earlier this year and I wanted to let you know that I used it as a basis for a shuffle plugin. Ultimately I decided to rewrite the plugin from scratch (except the tests) and orient myself on the design of the Spark shuffle manager.

The code is available here: https://github.com/ibm/spark-s3-shuffle/ . It acts as an external Spark plugin and allows to be loaded into Spark binary releases as a plugin.

I'm open to contributing this work back to Apache Spark if there is any interest.

dongjoon-hyun · 2022-05-05T21:48:56Z

Apache Spark community is open for any contribution, @pspoerri . You can make your own PR.
BTW, do you have any sharable result in terms of the stability and performance, @pspoerri ?

hiboyang · 2022-05-06T04:59:58Z

Hi @pspoerri great you are working on this and thanks for letting us know! I stopped working on my previous PR due to changing of work priority, but still would like to see people continue working in this area.

There is big value if storing Spark shuffle data on S3. It will save cost and also make Spark more resilient to disk error.

During my previous experiment, shuffle data on S3 will have very worse performance. There are a lot of optimization needed, e.g. S3 key prefix randomization to avoid S3 throttling and async S3 write. Will be happy to hear your thoughts on this as well.

michaelbilow · 2024-03-14T14:01:40Z

+1, would be great to get this working and part of the Spark ecosystem.

steveloughran · 2024-04-16T20:54:55Z

@michaelbilow hadoop s3a is on v2 sdk; the com.amazonaws classes are not on the CP and amazon are slowly stopping support. you cannot for example use the lower latency S3 express stores with it.

Like I say: I think you would be better off using the Hue file system APIs to talk to s3. If there are aspects of s3 storage which aren't available through the API -or just very inefficiently due to the effort to preserve the Posix metaphor, then lets fix the API so that other stores can offer the same features, and other apps can pick up.

For example, here's our ongoing delete API for iceberg and other manifest-based tables
apache/hadoop#6726
It maps to s3 bulk delete calls, but there's scope to add to other stores (we now actually want to add it as a page-size == 1 option for all filesystems as it simplifies iceberg integration).

pspoerri · 2024-04-17T07:05:27Z

@steveloughran How do I call the Hue APIs from Spark? Can you point me to a package?
I agree with you that using the Hadoop APIs are not ideal performance wise, but they are great from a usability and portability perspective.

Another issue is that Hadoop wants to know the size of every file it wants to read. While this makes sense for formats like parquet where the header is located at the last few bytes of the file. It does not make sense for shuffle where you know the exact block/file you want to read.

xleoken · 2024-09-07T10:43:27Z

+1, would be great to get this working and part of the Spark ecosystem.

hiboyang and others added 13 commits March 2, 2021 16:57

Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle fi…

ddb6d8f

…le on executor lost event

Update comment for spark.shuffle.markFileLostOnExecutorLost

de2d2b5

Add unit test for spark.shuffle.markFileLostOnExecutorLost

1418e8f

Revert "Add unit test for spark.shuffle.markFileLostOnExecutorLost"

0ab8a2b

This reverts commit 1418e8f.

Revert "Update comment for spark.shuffle.markFileLostOnExecutorLost"

fef4881

This reverts commit de2d2b5.

Revert "Add spark.shuffle.markFileLostOnExecutorLost to not delete sh…

796cd4e

…uffle file on executor lost event" This reverts commit ddb6d8f.

Merge remote-tracking branch 'upstream/master'

dce45c0

Merge remote-tracking branch 'upstream/master'

3f2eeb9

Merge remote-tracking branch 'upstream/master'

b25aaf7

Merge remote-tracking branch 'upstream/master'

c6f12c4

Merge remote-tracking branch 'upstream/master'

e47afcc

Merge remote-tracking branch 'upstream/master'

de71db2

Add external shuffle storage module to store shuffle data to S3

4b67c8a

github-actions bot added BUILD DOCS labels Dec 10, 2021

dongjoon-hyun requested changes Dec 12, 2021

View reviewed changes

Add config example to specify AWS region

8222f38

Reuse s3 transfer manager

761fe2a

steveloughran reviewed Jan 5, 2022

View reviewed changes

hiboyang closed this Jan 5, 2022

hiboyang reopened this Jan 5, 2022

Quick fixes per comments

a0f4f2f

github-actions bot added the Stale label Apr 16, 2022

github-actions bot closed this Apr 17, 2022

[SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 #34864

[SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 #34864

Conversation

hiboyang commented Dec 10, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 10, 2021

SparkQA commented Dec 10, 2021

c21 commented Dec 11, 2021

hiboyang commented Dec 11, 2021

SparkQA commented Dec 11, 2021

linzebing commented Dec 11, 2021

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

hiboyang commented Dec 13, 2021

hiboyang commented Dec 13, 2021

dongjoon-hyun commented Dec 13, 2021

hiboyang commented Dec 13, 2021

SparkQA commented Dec 13, 2021

SparkQA commented Dec 13, 2021

SparkQA commented Dec 14, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

SparkQA commented Dec 17, 2021

hiboyang commented Dec 20, 2021

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiboyang Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Jan 5, 2022

hiboyang commented Jan 5, 2022

github-actions bot commented Apr 16, 2022

pspoerri commented May 5, 2022

dongjoon-hyun commented May 5, 2022

hiboyang commented May 6, 2022

michaelbilow commented Mar 14, 2024

steveloughran commented Apr 16, 2024

pspoerri commented Apr 17, 2024

xleoken commented Sep 7, 2024

hiboyang commented Dec 10, 2021 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

hiboyang Jan 5, 2022 •

edited

Loading